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Embedded systems are often designed under stringent energy consumption budgets, to 
limit heat generation and battery size. Since memory systems consume a significant 
amount of energy to store and to forward data, it is then imperative to balance power 
consumption and performance in memory system design. Contemporary system design 
focuses on the trade-off between performance and energy consumption in processing and 
storage units, as well as in their interconnections. Although memory design is as ... 
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Designing ASICs for each new generation of backbone routers is a time intensive and 
fiscally draining process. In this paper we focus on the design of a programmable 
architecture for backbone routers, based on the manipulation of wide irregular memory 
words, that can provide a feasible design alternative to custom ASICs. We propose a 
pipelined memory design that emphasizes worst-case throughput over latency, and co- 
explore architectural tradeoffs with the design of several important network algo ... 
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Modern embedded microprocessors use low power on-chip memories called scratch-pad 
memories to store frequently executed instructions and data. Unlike traditional caches, 
scratch-pad memories lack the complex tag checking and comparison logic, thereby 
proving to be efficient in area and power. In this work, we focus on exploiting scratch-pad 
memories for storing hot code segments within an application. Static placement 
techniques focus on placing the most frequently executed portions of programs ... 
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In this paper, we present the power estimation methodologies for the development of a 
low-power security processor that contains significant amount of logic and memory. For 
the logic part, we present a highly accurate tool, called PowerMixer. This tool is a 
refinement of the so-called mixed-level methodology that combines the accuracy of quick 
SPICE and the speed of gate-level simulation. A grouping scheme is proposed so as to 
improve the accuracy for design blocks as large as 100K gates. ... 
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Advances in semiconductor technology permit increasingly complex applications to be 
realized using programmable systems-on-chips (SOCs). Furthermore, shrinking time-to- 
market demands, coupled with the need for product versioning through software 
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modification of SOC platforms, have led to a significant increase in the software content of 
these SOCs. However, designer productivity is greatly hampered by the lack of automated 
software generation tools for the exploration and evaluation of different ... 
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Pointer-chasing applications tend to traverse composite data structures consisting of 
multiple independent pointer chains. While the traversal of any single pointer chain leads 
to the serialization of memory operations, the traversal of independent pointer chains 
provides a source of memory parallelism. This article investigates exploiting such 
interchain memory parallelism for the purpose of memory latency tolerance, using a 
technique called multi-chain prefetching. Previous work ... 
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In the past decade, processor speed has become significantly faster than memory speed. 
Small, fast cache memories are designed to overcome this discrepancy, but they are only 
effective when programs exhibit data locality. In the this article, we present compiler 
optimizations to improve data locality based on a simple yet accurate cost model. The 
model computes both temporal and spatial reuse of cache lines to find desirable loop 
organizati ... 

Keywords: Cache, compiler optimization, data locality, loop distribution, loop fusion, loop 
permutation, loop reversal, loop transformations, microprocessors, simulation 



13 Unified compilation of Fortran 77D and 90D Q 
^ Alok Choudhary, Geoffrey Fox, Seema Hiranandani, Ken Kennedy, Charles Koelbel, Sanjay 

^ Ranka, Chau-Wen Tseng 

March 1993 ACM Letters on Programming Languages and Systems (LOPLAS), volume 2 

Issue 1-4 
Publisher: ACM Press 
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We present a unified approach to compiling Fortran 77D and Fortran 90D programs for 
efficient execution of MIMD distributed-memory machines. The integrated Fortran D 
compiler relies on two key observations. First, array constructs may be scalarized into 
FORALL loops without loss of information. Second, loop fusion, partitioning, and sectioning 
optimizations are essential for both Fortran D dialects. 
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In this work, we propose a SoC power estimation framework built on our system-level 
simulation environment. Our framework provides designers with the system-level power 
profile in a cycle-accurate manner. We target the framework to run fast and accurately, 
which is enabled by adopting different modeling techniques depending on the power 
characteristics of various IP blocks. The framework can be applied to any target SoC 
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5 Energy-aware design of embedded memories: A survey of technologies, 
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^ Luca Benini, Alberto Macii, Massimo Poncino 

February 2003 ACM Transactions on Embedded Computing Systems (TECS), Volume 2 

Issue 1 
Publisher: ACM Press 

Full text available: ^ pdf(288.44 KB) Additional Information: full citation , abstract , references , index terms 

Embedded systems are often designed under stringent energy consumption budgets, to 
limit heat generation and battery size. Since memory systems consume a significant 
amount of energy to store and to forward data, it is then imperative to balance power 
consumption and performance in memory system design. Contemporary system design 
focuses on the trade-off between performance and energy consumption in processing and 
storage units, as well as in their interconnections. Although memory design is as ... 
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Designing ASICs for each new generation of backbone routers is a time intensive and 
fiscally draining process. In this paper we focus on the design of a programmable 
architecture for backbone routers, based on the manipulation of wide irregular memory 
words, that can provide a feasible design alternative to custom ASICs. We propose a 
pipelined memory design that emphasizes worst-case throughput over latency, and co- 
explore architectural tradeoffs with the design of several important network algo ... 
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Dan Hiliman 

March 2005 Proceedings of the conference on Design, Automation and Test in Europe 
- Volume 3 DATE 05 
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Full text available: ^| pdf(1 49.06 KB) Additional Information: full citation , abstract , index terms 

At 130 nm and 90 nm, power consumption (both dynamic and static) has become a 
barrier in the roadmap for SoC designs targeting battery powered, mobile applications. 
This paper presents the results of dynamic and static power reduction achieved 
implementing Tensilica's 32-bit Xtensa microprocessor core, using Virtual Silicon's Power 
Management IP. Independent voltage islands are created using Virtual Silicon's VIP 
PowerSaver standard cells by using voltage level shifting cells and voltage isolati ... 
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Tiled architectures, such as RAW, SmartMemories, TRIPS, and WaveScalar, promise to 
address several issues facing conventional processors, including complexity, wire-delay, 
and performance. The basic premise of these architectures is that larger, higher- 
performance implementations can be constructed by replicating the basic tile across the 
chip. This paper explores the area-performance trade-offs when designing one such tiled 
architecture, WaveScalar. We use a synthesizable RTL model and cycle-le ... 
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terms 

A hardware prefetching mechanism named Speculative Prefetching is proposed. This 
scheme detects vector accesses issued by a load/store instruction and prefetches the 
corresponding data. The scheme requires no software add-on, and in some cases it is 
more powerful than software techniques for identifying regular accesses. The tradeoffs 
related to its hardware implementation are extensively discussed in order to finely tune 
the mechanism. Experiments show that average memory ... 
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behavior 

Somnath Ghosh, Margaret Martonosi, Sharad Malik 

July 1999 ACM Transactions on Programming Languages and Systems (TOPLAS), 

Volume 21 Issue 4 
Publisher: ACM Press 

Full text available* fl|pcif(548 18 KB) Additional Information: full citation , abstract , references , citings , index 
^ *' terms , review 

With the ever-widening performance gap between processors and main memory, cache 
memory, which is used to bridge this gap, is becoming more and more significant. Caches 
work well for programs that exhibit sufficient locality. Other programs, however, have 
reference patterns that fail to exploit the cache, thereby suffering heavily from high 
memory latency. In order to get high cache efficiency and achieve good program 
performance, efficient memory accessing behavior is necessary. In fact, f ... 

Keywords: cache memories, compilation, optimization, program transformation 



11 Architecture and Design of a High Performance SRAM for SoC Design Q 
Shobha Singh, Shamsi Azmi, Nutan Aarawal, Penaka Phani, Ansuman Rout 
January 2002 Proceedings of the 2002 conference on Asia South Pacific design 

automation/VLSI Design 
Publisher: IEEE Computer Society 

Full text available: ■fpl pdfd 86.65 KB) 

ST Additional Information: full citation , abstract 

ffP 1 Publisher Site 

Critical issues in designing a high speed, low power static RAM in deep submicron 
technologies are described along with the design techniques used to overcome them. With 
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appropriate circuit partioning, transistor sizing, choice of a suitable Sense Amplifier, a 
good resetting technique and judicial use of dual Vth transistors we have achieved a high 
speed memory without dissipating too much power. The Introduction gives the 
specifications of the memory that was our design target. In Section II, w ... 

12 Session 11 A: embedded tutorial: System and architecture-level power reduction of 
microprocessor-based com municati on and mu l t i -media a p plic a t i ons 
Lode Nachtergaele, VivekTiwari, Nikil Dutt 

November 2000 Proceedings of the 2000 IEEE/ACM international conference on 
Computer-aided design 

Publisher: IEEE Press 

Full text available: ^| pdf(34.99 KB) Additional Information: full citation , abstract , references , citings 

Current microprocessor architectures become more and more dominated by the data 
access bottlenecks in the cache, system bus and main memory subsystems. These also 
have a major influence on the system (board-level) power consumption. In practice this 
means lower energy consumption for a given throughput requirement. In the booming 
domain of (largely embedded) cost-sensitive communication and multi-media 
applications, more and more implementations make use of microprocessor based 
platforms for flex ... 



13 Security processor design: Power estimation starategies for a low-power security 
processor 

Yen-Fong Lee, Shi-Yu Huang, Sheng-Yu Hsu, I-Ling Chen, Cheng-Tao Shieh, jian-Cheng Lin, 
Shih-Chieh Chang 

January 2005 Proceedings of the 2005 conference on Asia South Pacific design 
automation ASP-DAC '05 

Publisher: ACM Press 

Full text available: |j| pdf(398.87 KB) Additional Information: full citation , abstract , references 

In this paper, we present the power estimation methodologies for the development of a 
low-power security processor that contains significant amount of logic and memory. For 
the logic part, we present a highly accurate tool, called PowerMixer. This tool is a 
refinement of the so-called mixed-level methodology that combines the accuracy of quick 
SPICE and the speed of gate-level simulation. A grouping scheme is proposed so as to 
improve the accuracy for design blocks as large as 100K gates. ... 




14 PARADIGM: a compiler for automatic data distribution on multicomputers 
Manish Gupta, Prithviraj Banerjee 

August 1993 Proceedings of the 7th international conference on Supercomputing 
Publisher: ACM Press 

Full text available- f £|rxJf(1.11 MB) Additional Information: full citation , abstract , references , citings , index 
^ terms 

One of the most challenging steps in developing a parallel program for a distributed 
memory machine is determining how data should be distributed across processors. Most 
of the compilers being developed to make it easier to program such machines still provide 
no assistance to the programmer in this difficult and machine-dependent task. We have 
developed PARADIGM, a compiler that makes data partitioning decisions for Fortran 77 
procedures. A significant feature of the design of PARADIGM is t ... 

15 A general framework for prefetch scheduling in linked data structures and its 
A application to multi-chain prefetching 

^ Seungryul Choi, Nicholas Kohout, Sumit Pamnani, Dongkeun Kim, Donald Yeung 
May 2004 ACM Transactions on Computer Systems (TOCS), Volume 22 issue 2 
Publisher: ACM Press 
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Pointer-chasing applications tend to traverse composite data structures consisting of 
multiple independent pointer chains. While the traversal of any single pointer chain leads 
to the serialization of memory operations, the traversal of independent pointer chains 
provides a source of memory parallelism. This article investigates exploiting such 
interchain memory parallelism for the purpose of memory latency tolerance, using a 
technique called multi-chain prefetching. Previous work ... 

Keywords: Data prefetching, memory parallelism, pointer-chasing code 
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Full text available: ^ pdf(967.75 KB) Additional Information: full citation , references , citings , index terms 



u 




17 Compiler Managed Dynamic Instruction Placement in a Low-Power Code Cache Q 
Rajiv A. Ravindran, Pracheeti D. Nagarkar, Ganesh S. Dasika, Eric D. Marsman, Robert M. 
Senger, Scott A. Mahlke, Richard B. Brown 

March 2005 Proceedings of the international symposium on Code generation and 
optimization CGO '05 

Publisher: IEEE Computer Society 

Full text available: ^pdf(631.68 KB) Additional Information: full citation , abstract , index terms 

Modern embedded microprocessors use low power on-chip memories called scratch-pad 
memories to store frequently executed instructions and data. Unlike traditional caches, 
scratch-pad memories lack the complex tag checking and comparison logic, thereby 
proving to be efficient in area and power. In this work, we focus on exploiting scratch-pad 
memories for storing hot code segments within an application. Static placement 
techniques focus on placing the most frequently executed portions of programs ... 

18 Online Only: ACM Transactions on Design Automation of Electronic Systems, vol. 11. Q 
^ issue 3 (Novel Paradigms in System-Level Desi g n): Architecture description 

^ language (ADU-driven software toolkit generation for architectural exploration of 
programmable SOCs 
Prabhat Mishra, Aviral Shrivastava, Nikil Dutt 

June 2004 ACM Transactions on Design Automation of Electronic Systems (TODAES) , 
Proceedings of the 41st annual conference on Design automation DAC 
'04, Volume 11 Issue 3 

Publisher: ACM Press 

Full text available: |l| pdf(1.07 MB) Additional Information: full citation , abstract , references , index terms 

Advances in semiconductor technology permit increasingly complex applications to be 
realized using programmable systems-on-chips (SOCs). Furthermore, shrinking time-to- 
market demands, coupled with the need for product versioning through software 
modification of SOC platforms, have led to a significant increase in the software content of 
these SOCs. However, designer productivity is greatly hampered by the lack of automated 
software generation tools for the exploration and evaluation of different ... 

Keywords: Architecture description language, design space exploration, embedded 
processor, programmable architecture, retargetable compilation 
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19 Compiling for shared-memory and message-passing computers 
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March 1993 ACM Letters on Programming Languages and Systems (LOPLAS), volume 2 

Issue 1-4 
Publisher: ACM Press 

Full text available* Wi pdf(1.27 MB) Additional Information: full citation , abstract , references , citings , index 
' terms 

Many parallel languages presume a shared address space in which any portion of a 
computation can access any datum. Some parallel computers directly support this 
abstraction with hardware shared memory. Other computers provide distinct (per- 
processor) address spaces and communication mechanisms on which software can 
construct a shared address space. Since programmers have difficulty explicitly managing 
address spaces, there is considerable interest in compiler support for shared address 
spac ... 

Keywords: cache coherence, compilers, directory protocols, memory systems, message- 
passing multiprocessors, parallel programming languages, shared-memory 
multiprocessors 
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21 Compiler optimizations for Fortran D on MIMD distributed-memory machines Q 
Seema Hiranandani, Ken Kennedy, Chau-Wen Tseng 

August 1991 Proceedings of the 1991 ACM/IEEE conference on Supercomputing 
Publisher: ACM Press 

Full text available: ^pdf(1.57 MB) Additional Information: full citation , references , citings , index terms 



22 Compiler-directed page coloring for multiprocessors 

Edouard Bugnion, Jennifer M. Anderson, Todd C. Mowry, Mendel Rosenblum, Monica S. Lam 
September 1996 ACM SIGPLAN Notices , ACM SIGOPS Operating Systems Review , 

Proceedings of the seventh international conference on Architectural 
support for programming languages and operating systems ASPLOS- 
VII, Volume 31 , 30 Issue 9,5 
Publisher: ACM Press 

Additional Information: full citation , abstract , references , citings , index 
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Full text available: pdfd. 37 MB) 



This paper presents a new technique, compiler-directed page coloring, that eliminates 
conflict misses in multiprocessor applications. It enables applications to make better use 
of the increased aggregate cache size available in a multiprocessor. This technique uses 
the compiler's knowledge of the access patterns of the parallelized applications to direct 
the operating system's virtual memory page mapping strategy. We demonstrate that this 
technique can lead to significant performance impr ... 

Fortran 9QD/HPF compiler for distributed memory MIMD computers: design. 

im plementation, and performance result s 

2. Bozkus, A. Choudhary, G. Fox, T. Haupt, S. Ranka 

December 1993 Proceedings of the 1993 ACM/IEEE conference on Supercomputing 
Publisher: ACM Press 
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In this article, we explore the potential of classical dataflow analysis techniques in 
removing overhead in write-invalidate cache coherence protocols for shared-memory 
multiprocessors. We construct the compiler algorithms with varying degree of 
sophistication that detect loads followed by stores to the same address. Such loads are 
marked and constitute a hint to the cache to obtain an exclusive copy of the block so that 
the subsequent store does not introduce access penalties. The simplest ... 

Keywords: cache coherence, dataflow analysis, performance evaluation 



25 Improving data locality with loop transformations 
Kathryn S. McKinley, Steve Carr, Chau-Wen Tseng 
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Publisher: ACM Press 

Full text available: fg) pdf(41 1 40 KB) Ac,ditiona, Information: full citation , abstract , references , citing s, index 
^ terms , review 

In the past decade, processor speed has become significantly faster than memory speed. 
Small, fast cache memories are designed to overcome this discrepancy, but they are only 
effective when programs exhibit data locality. In the this article, we present compiler 
optimizations to improve data locality based on a simple yet accurate cost model. The 
model computes both temporal and spatial reuse of cache lines to find desirable loop 
organizati ... 

Keywords: Cache, compiler optimization, data locality, loop distribution, loop fusion, loop 
permutation, loop reversal, loop transformations, microprocessors, simulation 
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We present a unified approach to compiling Fortran 77D and Fortran 90D programs for 
efficient execution of MIMD distributed-memory machines. The integrated Fortran D 
compiler relies on two key observations. First, array constructs may be scalarized into 
FORALL loops without loss of information. Second, loop fusion, partitioning, and sectioning 
optimizations are essential for both Fortran D dialects. 

Keywords: Fortran D, parallel languages, parallel programming 
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August 1993 Proceedings of the 7th international conference on Supercomputing 
Publisher: ACM Press 
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Mapping data to parallel computers aims at minimizing the execution time of the 
associated application. However, it can take an unacceptable amount of time in 
comparison with the execution time of the application if the size of the problem is large. 
In this paper, first we motivate the case for graph contraction as a means for reducing the 
problem size. We restrict our discussion to applications where the problem domain can be 
described using a graph (e.g., computational fluid dynamics appl ... 

30 Architecture-independent scientific programming in data parallel C: three case Q 
studies 

Philip J. Hatcher, Michael J. Quinn, Ray J. Anderson, Anthony J. Lapadula, Bradley K. 
Seevers, Andrew F. Bennett 
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We believe the absence of massively-parallel, shared-memory machines follows from the 
lack of a shared-memory programming performance model that can inform programmers 
of the cost of operations (so they can avoid expensive ones) and can tell hardware 
designers which cases are common (so they can build simple hardware to optimize them). 
Cooperative shared memory, our approach to shared-memory design, addresses this 
problem. Our initial implementation of cooperativ ... 
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In the past decade, processor speed has become significantly faster than memory speed. 
Small, fast cache memories are designed to overcome this discrepancy, but they are only 
effective when programs exhibit data locality. In this paper, we present compiler 
optimizations to improve data locality based on a simple yet accurate cost model, the 
model computes both temporal and spatial reuse of cache lines to find desirable loop 
organizations. T ... 
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November 1998 Proceedings of the 1998 ACM/IEEE conference on Supercomputing 
(CDROM) 
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We describe an implementation of a sizable subset of OpenMP on networks of 
workstations (NOWs). By extending the availability of OpenMP to NOWs, we overcome 
one of its primary drawbacks compared to MPI, namely lack of portability to environments 
other than hardware shared memory machines. In order to support OpenMP execution on 
NOWs, our compiler targets a software distributed shared memory system (DSM) which 
provides multi-threaded execution and memory consistency.This paper presents two 
contri ... 
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terms 

This paper presents a new parallelization method for reductions of arrays with subscripted 
subscripts on scalable shared memory multiprocessors. The mapping of computations is 
based on grouping reduction loop iterations into sets that are further assigned to the 
cooperating threads of computation. Iterations belonging to the same set are chosen in 



http://portal.acm.org/resultsxfm?quei7=%2B%22memory%20compiler%22°/o20%2Bchar... 10/4/2006 



Results (page 2): +"memory compiler" H-characterization 



Page 5 of 5 



such a way that update different entries in the reduction array. That is, the loop 
distribution implies a conflict-free write distribution of the ... 
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The Fortran D compiler uses data decomposition specifications to automatically translate 
Fortran programs for execution on MIMD distributed-memory machines. This paper 
introduces and classifies a number of advanced optimizations needed to achieve 
acceptable performance; they are analyzed and empirically evaluated for stencil 
computations. Profitability formulas are derived for each optimization. Results show that 
exploiting parallelism for pipelined computations, reductions, and scans is vi ... 
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